<html>
<head>
  <title>Non-uniform file encodings in the Eclipse Platform</title>
  <link rel="stylesheet" href="http://dev.eclipse.org/default_style.css" type="text/css">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<h1>Non-uniform file encodings in the Eclipse Platform</h1>
<p><font size="-1">Last modified: June 12, 2003</font> </p>
  <cite><strong>Plan item description:</strong> Eclipse 2.1 uses 
    a single global file encoding setting for reading and writing files in the 
    workspace. This is problematic; for example, when Java source files in the 
    workspace use OS default file encoding while XML files in the workspace use 
    UTF-8 file encoding. The Platform should support non-uniform file encodings. 
    [Platform Core, Platform UI, Text, Search, Compare, JDT UI, JDT Core] [Theme: 
    User experience] (bug <a href="http://bugs.eclipse.org/bugs/show_bug.cgi?id=37933">37933</a>, 
    <a href="http://dev.eclipse.org/bugs/show_bug.cgi?id=5399">5399</a>) </cite>
<p>The current situation is as follows:</p>
<ul>
  <li><code>ResourcesPlugin.getEncoding</code> returns the default encoding for 
    the workspace (the <code>org.eclipse.core.resources.encoding</code> preference 
    value if available, otherwise the value of the <code>file.encoding</code> 
    Java system property).</li>
  <li><code>IFile.getContents</code>/<code>setContents</code> work with byte streams 
    - no encoding can be applied.</li>
  <li><code>IFile.getEncoding</code> tries to guess the file encoding (looking 
    for the <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Byte Order 
    Mark</a>), which is not enough. Also, this API has no known client 
    so far. This API method would be deprecated.</li>
  <li>the Java compiler supports non-uniform encondings for Java source files, 
    but in Eclipse it relies on <code>ResourcesPlugin.getEncoding (same value 
    for all sources)</code>.</li>
  <li>the text editor framework supports setting the encoding for files being 
    edited (setting a persistent property on the file resource), but there is 
    no support for setting the encoding of multiple files simultaneously, and 
    other components are not aware of the encoding settings.</li>
</ul>
<h2>Requirements</h2>
<ul>
  <li>users should be able to set the encoding for a file or a default encoding 
    for a container (folders, projects, the workspace root) and its children.</li>
  <li>users should be able to share encoding settings in a team repository (metadata 
    should reside in the project content area).</li>
  <li>file-specific encoding set by users prevails upon file contents-based encoding.</li>
  <li>file contents-based encoding prevails upon the inherited 
    encoding setting.</li>
</ul>
<h2>Proposed solution</h2>
<p>The encoding for a resource (as returned by <code>IResource.getCharset</code> 
  - see API changes) will be:</p>
<ol>
  <li> the encoding explictly set by a client/user (with <code>IResource.setCharset</code> - 
    see API changes), if any, or</li>
  <li> for a file resource, the encoding discovered by an <em>encoding interpreter</em> 
    associated to the file extension, if one exists <em>and</em> can determine 
    the encoding, or</li>
  <li> for a file resource, the file encoding determined by its Byte Order Mark, 
    if it exists, or</li>
  <li> the resource parent's encoding (except for the workspace root, whose encoding 
    is equivalent to ResourcesPlugin.getEncoding()).</li>
</ol>
<p>Regarding #2, an extension-point would allow file format-aware encoding interpreters 
  to register to the encoding discovery mechanism for specific file types (extensions) 
  or to associate existing encoding interpreters to their own file extensions. 
  Users would be able to associate more file extensions for the known interpreters 
  (preference).</p>
<p>All clients, when creating character-based streams when reading/writing the 
  contents of a file resource, should pass along the charset string obtained from 
  <code>IFile.getCharset</code> instead of the one provided by <code>ResourcesPlugin.getEncoding</code>. 
  Examples are: text editors, compiler, search, compare. </p>
<p>Also, setting the encoding for a resource would generate a resource change 
  event, but only for the directly affected resource (if clients are interested 
  on what effects the change in a directory had on files inside it, they will 
  have to find it out by themselves). </p>
<h3>API changes</h3>
<h4>Added:</h4>
<p><code>public void IResource.setCharset(String charsetName) throws CoreException</code></p>
<p>Sets the charset name for this resource. May be <code>null</code>, which sets 
  it to default. For the workspace root, it sets the workspace's default encoding 
  preference to the charset's canonical name (or to the default encoding, if <code>null</code> 
  was provided). </p>
<p><code>public String IResource.getCharset() throws CoreException</code></p>
<p>Returns the name of the charset for this resource. For files, if none has been 
  defined (with <code>setCharset</code>), returns the default charset. To determine 
  the default charset, it tries to guess it by a) inspecting the file contents 
  (BOM), b) calling the corresponding encoding interpreter (if any). Otherwise, 
  the parent's charset is returned. For the workspace root, a charset corresponding 
  to the workspace's default encoding preference is returned.</p>
<p><code>public boolean IResource.isDefaultCharset() throws CoreException</code></p>
<p>Returns <code>true</code> if the currently configured charset was not explicitly 
set by the user - (has a default value either guessed by file contents, or inherited from parent).</p>
<p><code>public static final int IResourceDelta.ENCODING = 0x100000;</code></p>
<p><code>public String IResourceDelta.getNewCharset();</code></p>
<p><code>public String IResourceDelta.getOldCharset();</code></p>
<p>For notifying changes in file encodings. Both methods should only be called 
  only valid when <code>getKind()==CHANGE</code>, and <code>(getFlags()&amp;ENCODING)!=0</code>.</p>
<pre>public interface IEncodingInterpreter {
	/** returns null if the charset cannot be determined. */
	public String interpretCharset(java.io.InputStream input);
}</pre>
<p>Encoding interpreters will be associated to file types through a new core resources 
  extension point. Users can associate additional file extensions ia preferences.</p>
<p>The platform would provide itself implementations for xml and other popular 
  (?) file formats.</p>
<h4>Deprecated:</h4>
<pre>public int IFile.getEncoding()
public int IFile.ENCODING_* constants</pre>
<h3>Encoding settings metadata</h3>
<p>The encoding settings metadata will be stored inside the project's content 
  area so it can be easily shared.</p>
<h3>Scenarios</h3>
<ul>
  <li>The user opens in an editor a text file whose contents where created using 
    encoding &quot;MS932&quot; in a workspace whose default encoding is &quot;US-ASCII&quot;. 
    It was not possible to guess the file encoding automatically, so what the 
    user sees is gibberish. The user figures out the cause of the problem and 
    expliclty sets the encoding for that specific file to be &quot;MS932&quot;. 
    The editor will get notified and might offer to the user the option of reloading 
    the file contents.</li>
  <li>The user gets the source code for a set of classes he/she needs to use, 
    but the classes do not compile because the author used internal Java identifiers 
    not supported in the user workspace's current encoding. The user then selects 
    the offending source files, and apply to them the correct encoding. The encoding 
    change in the affected files will be reported in subsequent resource change 
    events (to listeners and builders). Builders may recompute build state affected 
    by changed encodings. Views depending on the contents of the affected files 
    may decide to reload the contents using the new encoding. </li>
  <li> If the user changes the encoding for a bunch of <em>directories</em>, only 
    the directly affected resources will appear in the delta. Clients may want 
    to re-read/re-build the contents of files whose parent changed encoding. Otherwise, 
    the user will have to trigger a full build in the affected project. Refresh 
    will not help.</li>
  <li>The user moves a text file with default encoding to a directory which has 
    a different encoding than the previous parent (which means the encoding for 
    the file has changed). No encoding changes will be reported. Clients may want 
    to re-read the file when it is moved if it reports a different encoding than 
    the one originally used to load its contents.</li>
</ul>
</body></html>
